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Abstract 

In this work we find the capacity of a compound finite-state channel with time-invariant deterministic 
. . . feedback. The model we consider involves the use of fixed length block codes. Our achievability result 

■ includes a proof of the existence of a universal decoder for the family of finite-state channels with 

O 

' feedback. As a consequence of our capacity result, we show that feedback does not increase the capacity 

^ ' of the compound Gilbert-Elliot channel. Additionally, we show that for a stationary and uniformly ergodic 

o 

Markovian channel, if the compound channel capacity is zero without feedback then it is zero with 



in 



X 



feedback. Finally, we use our result on the finite-state channel to show that the feedback capacity of the 
memoryless compound channel is given by infg maxg^ I{X] Y\9). 



c/3 ' Index Terms 

_CJ ' 

compound channel, feedback capacity, finite state channel, directed information, causal conditioning 
^ ' probability, Gilbert-Elliot channel, universal decoder, code-trees, types of code-trees, Sanov's theorem, 

in ' 

I Pinsker's inequality 

o 

I. Introduction 

j-^ ' The compound channel consists of a set of channels indexed by 6* € with the same input and output 

^ alphabets but different conditional probabilities. In the setting of the compound channel only one actual 



channel is used in all transmissions. The transmitter and the receiver know the family of channels but 
they have no prior knowledge of which channel is actually used. There is no distribution law on the 
family of channels and the communication has to be reliable for all channels in the family. 

Blackwell et al. [1] and independently Wolfowitz [2] showed that the capacity of a compound channel 
consisting of memoryless channels only, and without feedback, is given by 

ma-Kmil{Qx]PY\x,e), (1) 
Qx f) 

where Qx{') denotes the input distribution to the channel, iV|x,9("K ^) denotes the conditional probabil- 
ity of a memoryless channel indexed by 9, and the notation Z{Qx'i PY\x,e) denotes the mutual information 
of channel PY\x,e for the input distribution Qx, i e., 

l{Qx;PY\x,e) ^ ^Qx(x)Py|x,.(y|x,g)ln ^^^'^^p^'^'? , , ,y (2) 
^ 2lx-Qx{x')PY\x,e{y\x',0) 



February 2, 2008 



DRAFT 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007. 



1 



The capacity in ([T]) is in general less than the capacity of every channel in the family. Wolfowitz, who 
coined the term "compound channel," showed that if the transmitter knows the channel 9 in use, then 
the capacity is given by [3, chapter 4] 

inf maxI{Qx;PY\x,e) = inf Ce, (3) 

& Qx o 

where Ce is the capacity of the channel indexed by 9. This shows that knowledge at the transmitter of 
the channel 9 in use helps in that the infimum of the capacities of the channels in the family can now 
be achieved. In the case that is a finite set, then it follows from Wolfowitz 's result that ming Cq is 
the feedback capacity of the memoryless compound channel, since the transmitter can use a training 
sequence together with the feedback to estimate 9 with high probability. In this paper we show that 
when is not limited to finite cardinality, the feedback capacity of the memoryless compound channel 
is given by infg Ce. One might be tempted to think that for a compound channel with memory, feedback 
provides a means to achieve the infimum of the capacities of the channels in the family. However this 
is not necessarily true, as we show in Example [T] which is taken from [4] and applied to the compound 
Gilbert-Elliot channel with feedback. That example is found in Section |Vl 

A comprehensive review of the compound channel and its role in communication is given by Lapidoth 
and Narayan [5]. Of specific interest in this paper are compound channels with memory which are often 
used to model wireless communication in the presence of fading [6]-[8]. Lapidoth and Telatar [4] derived 
the following formula for the compound channel capacity of the class of finite state channels (FSC) when 
there is no feedback available at the transmitter. 

lim maxinf -2:(Qx";^V"|X",so,e)> (4) 

where sq denotes the initial state of the FSC, and Qx"{-) and PY"\X'^,sofi{'\'^^0:S) denote the input 
distribution and channel conditional probability for block length n. Lapidoth and Telatar's achievability 
result makes use of a universal decoder for the family of finite-state channels. The existence of the 
universal decoder is proved by Feder and Lapidoth in [9] by merging a finite number of maximum- 
likelihood decoders, each tuned to a channel in the family 0. 

Throughout this paper we use the concepts of causal conditioning and directed information which were 
introduced by Massey in [10]. Kramer extended those concepts and used them in [11] to characterize the 
capacity of discrete memoryless networks. Subsequently, three different proofs - Tatikonda and Mitter 
[12], [13], Permuter, Weissman and Goldsmith [14] and Kim [15] - have shown that directed information 
and causal conditioning are useful in characterizing the feedback capacity of a point-to-point channel 
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with memory. In particular, this work uses results from [14] that show that Gallager's [6, ch. 4,5] upper 
and lower bound on capacity of a FSC can be generalized to the case that there is a time-invariant 
deterministic feedback, Zi^i = available at the encoder at time i. 

In this paper we extend Lapidoth and Telatar's work for the case that there is deterministic time-invariant 
feedback available at the encoder by replacing the regular conditioning with the causal conditioning. Then 
we use the feedback capacity theorem to study the compound Gilbert-Elliot channel and the memoryless 
compound channel and to specify a class of compound channels for which the capacity is zero if and only 
if the feedback capacity is zero. The proof of the feedback capacity of the FSC is found in Section JIIJ 
which describes the converse result, and Section |IVj where we prove achievability. As a consequence of 
the capacity result, we show in Section |V] that feedback does not increase the capacity of the compound 
Gilbert-Elliot channel. We next show in Section |Vl] that for a family of stationary and uniformly ergodic 
Markovian channels, the capacity of the compound channel is positive if and only if the feedback capacity 
of the compound channel is positive. Finally, we return to the memoryless compound channel in Section 
I VII I and make use of our capacity result to provide a proof of the feedback capacity. 

The notation we use throughout is as follows. A capital letter X denotes a random variable and a 
lower-case letter, x, denotes a realization of the random variable. Vectors are denoted using subscripts 
and superscripts, = and xf = We deal with discrete random variables 

where a probability mass function on the channel input is denoted Qx"{x") = Pr(X" = x") and 
Py,.|X",e(y"l^") ^) = Pr(^" = y^\X^ = x^,9) denotes a mass function on the channel output. When 
no confusion can result, we will omit subscripts from the probability functions, i.e., Q{xi\x^~^ ,y'^~^) 
will denote y*""*^)- 

II. Problem statement and main result 

The problem we consider is depicted in Figure [T] A message W from the set {1, 2, ... , e"^} is to be 
transmitted over a compound finite state channel with time-invariant deterministic feedback. The family 
Q of finite state channels has a common state space S and common finite input and output alphabets 
given by X and y. For a given channel ^ G 6 the channel output at time i is characterized by the 
conditional probability 

P{yi,Si\xi,Si-i,9), yi ey,Xi £ ?!:,Si,Si-i £ S. (5) 

^Although Wolfowitz mentions the feedback problem in discussing the memoryless compound channel [3, ch. 4], to the best 
of our knowledge, this result has not been proved in any previous work. 
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Fig. 1. Compound finite state channel with feedback that is a time-invariant deterministic function of the channel output. 



which satisfies the condition P(yj, Sj|a;*, s*~^, ^) = P(yj, 0). The channel 9 is in use 

over the sequence of n channel inputs. The family of channels is known to both the encoder and 
decoder, however, they do not have knowledge of the channel 9 in use before transmission begins. 

The message W is encoded such that at time i the codeword symbol Xi is a function of W and the 
feedback sequence For notational convenience, we will refer to the input sequence X'^{W,Z^^^) 

as simply X\ The feedback sequence is a time-invariant deterministic function of the output Yi and is 
available at the encoder with a single time unit delay. The function performed on the channel output Yi 
to form the feedback Zj is known to both the transmitter and receiver before communication begins. The 
decoder operates over the sequence of channel outputs Y"- to form the message estimate W. 

For a given initial state sq £ S and channel 9 £ @, the channel causal conditioning distribution is 
given by 

n 

P(y"||x",so,0) ^llP{yi\x\y'-\so,9). (6) 

i=l 

Additionally we will make use of Massey's directed information [10]. When conditioned on the initial 
state and channel, the directed information is given by 



/(X- ^ Y^\so, 9) = Y1 1{Y^■,X'\Y'-\so, 9). 



(7) 



i=l 



Our capacity result will involve a maximization of the directed information over the input distribution 

Q(x"||z"'~^) which is defined as 



Q(x"iiz"-i)^n^( 



)• 



(8) 



i=l 



We make use of some of the properties provided in [10], [14] in our work, including the following three 
which we restate for our problem setting. 

1) P(a;",y"|so,^) = Q(x"| |y"-^)P(?/"| sq, ^) [10, eq. (3)] [14, Lemma 1] 
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2) y"|6') - /(X" y^lS", 6)\ < log where random variable S denotes the state of the 
finite-state channel [14, Lemma 5] 

3) From [14, Lemma 6] , 

i(x^^Y^\so,e) = x(gx"iiy"-;^'y"iix",so,e) 

Note that properties [Til and O hold since Q(2;"| sq, ^) = Q{x'^\\y'^^^) for our feedback setting, 
where it is assumed that the state sq is not available at the encoder. 

For a given initial state sq and channel 6 the average probability of error in decoding message w is 
given by 

where is a function of the message w and of the feedback z"^^. The average (over messages) error 
probability is denoted P(,{so,9), where Pe{so,6) = ^/e"'^Ylw-^e,w{so,6). We say that a rate R is 
achievable for the compound channel with feedback as shown in Figure [T] if for any e > there exists a 
code of fixed blocklength n and rate R, i.e. (n, e"^), such that Pe{so, 6) < e for all E B and sq € S. 
Equivalently, rate R is achievable if there exists a sequence of rate-i? codes such that 

lim supPe(so,6') = 0. (9) 

This definition of achievable rate is identical to that given in previous work on the compound channel 
without feedback. A different definition for the compound channel with feedback could also be consid- 
ered; for instance, in [16], the authors consider codes of variable blocklength and define achievability 
accordingly. 

The capacity is defined as the supremum over all achievable rates and is given in the following theorem. 

Theorem 1: The feedback capacity of the compound finite state channel is given by 

C= lim max inf ^ y"|so, (10) 

Theorem [T] is proved in Section HIIJ which shows the existence of C and proves the converse, and Section 
ITVl where achievability is established. 

III. Existence of C and the converse 

We first state the following proposition, which shows that the capacity C as defined in Theorem [T] 
exists. The proof is found in Appendix H 
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Proposition 1: Let 

Cn= max inf ^y"|so,0). (11) 

Then C„ is well defined and converges for n ^ oo. In addition, let 

Cn = Cn ^- (12) 

n 

Then 

lim Cn = sup Cn (13) 

n— >oo „ 

To prove the converse in Theorem [T] we assume a uniform distribution on the message set, for which 
H{W) = nR. Since the message is independent of the channel parameters H{W) = H{W\so,6) and 
we apply Fano's inequality as follows. 

nR = H{W\so,e) 

= I{Y'';W\so,9) + H{W\Y'\so,9) 

< I{Y'';W\so,e) + Pe{so,0)nR+l 

= H{Y''\so,e) - H{Y''\W,so,e) + Peiso,e)nR + l 

n n 

= H{Yi\Y'-\so, 0)-J2 H{Y,\Y'-\W, so, 9) + Pe(so, 0)nR + 1 

1=1 i=l 

n n 

= H{Yi\Y''\so, 0)-Y, H{Y,\Y'-\W, X\W, Z'-\Y'-^)), sq, 9) + Pe(so, 0)nR + 1 
1=1 1=1 

n n 

= Y H{Yi\Y'-\so, e)-J2 H{Y,\Y'~\X\ sq, 6) + Pe(so, e)nR + 1 

i=l i=l 
n 

= Y ^(^^; X'\Y''\ so, e) + Pe{so, 9)nR + 1 
1=1 

= /(X" ^y"|so,0) + Pe(so,^)nP+l 
For any code we have 

/(X- ^ Y^\so, 9) > nR{l - P,(so, 0)) - 1 (14) 

and therefore 

inf /(X" ^ y"|so,^) >nR{l-supPe{so,9))-l. (15) 
By combining the above statement with Proposition [T] we have 

1 1 I O I 

C>Cn> R{1- sup P,{so, 9)) (16) 

so,e n n 

Then for a sequence of codes of rate R with lim„_+oo sup^^ q Pe{sQ, 9) = 0, this implies R < C. 
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IV. ACHIEVABILITY 

Before proving achievability, we mention a simple case which follows from previous results. If the 
set B has finite cardinality, then achievability follows immediately from the results in [14, Theorem 14], 
which are true for any finite state channel with feedback. Hence, we can construct a finite state channel 
where the augmented state is (s, 9) and by assuming that the initial distribution is positive for all (sq, ^) 
then we get that for any ^ G 6, |6| < oo and any sq € 5 the rate R is achievable if 

i?< lim max min ^ y"|so, 6*). (17) 

More work is needed in the achievability proof when the set is not restricted to finite cardinality. 
This is outlined in the following subsections in three steps. In the first step, we assume that the decoder 
knows the channel 6 in use and we show in Theorem |2] that if i? < C and if the decoder consists of 
a maximum-likelihood decoder, then there exist codes for which the error probability decays uniformly 
over the family Q and exponentially in the blocklength. The codes used in showing this result are codes 
of blocklength Nm where each sub-block of length m is generated i.i.d. according to some distribution. 
In the second step, we show in Lemma [3] that if instead the codes are chosen uniformly and independently 
from a set of possible blocklength-A^m codes, then the error probability still decays uniformly over B 
and exponentially in the blocklength. In the third and final step, we show in Theorem |4] and Lemma [5] 
that for codes chosen uniformly and independently from a set of blocklength- A^m codes, there exists 
a decoder that for every channel 9 € Q achieves the same error exponent as the maximum-likelihood 
decoder tuned to 9. 

In the sections that follow, 'P{X'^\\Z^''^^) denotes the set of probability distributions on X" causally 
conditioned on Z"^^. 

A. Achievability for a decoder tuned to 9 

We begin by proving that if the decoder is tuned to the channel ^ G in use, i.e., if the decoder 
knows the channel 9 in use, and if i? < C then the average error probability approaches zero. This is 
proved through the use of random coding and maximum likelihood (ML) decoding. 

The encoding scheme consists of randomly generating a code-tree for each message w, as shown in 
Figure |2l;b) for the case of binary feedback. A code-tree has depth n corresponding to the blocklength and 
level i designates a set of \Z\^^^ possible codeword symbols. One of the \Z\^^^ symbols is chosen as the 
input Xi according to the feedback sequence z'-^^. The first codeword symbol is generated as Xi ~ Q{xi). 
The second codeword symbol is generated by conditioning on the previous codeword symbol and on the 
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feedback, X2 ~ Q{x2\xi, zi) for all possible values of zi. For instance, in the binary case, \Z\ = 2, two 
possible values (branches) of X2 will be generated and the transmitted codeword symbol will be selected 
from among these two values according to the value of the feedback Zi. Subsequent codeword symbols 
are generated similarly, Xi ~ Q{xi\x''~^, 2'"^) for all possible z^~^. For a given feedback sequence z""^, 
the input distribution, corresponding to the distribution on a path through the tree of depth n, is 

n 

Q(a;"||^"-i) ='[[Q{xi\x'-\z'-') (18) 
1=1 



(a) codeword (no feedback) (b) code-tree (c) concatenated code-tree 




i = l i = 2 i = 3 i = l i = 2 i = 3 i = 1 i = 2 i = 3 i = 4 <no feedback) 

Fig. 2. Illustration of coding scheme for (a) setting without feedback, (b) setting with binary feedback as used in [14] and 
(c) a code-tree that was created by concatenating smaller code-trees. In the case of no feedback each message is mapped to 
a codeword, and in the case of feedback each message is mapped to a code-tree. The third scheme is a code-tree of depth 4 
created by concatenating two trees of depth 2. 

A code-tree of depth n is a vector of D{n) symbols, where 

i=i ' I 

and each element in the vector takes value from the alphabet X. We denote a random code-tree by ^^(") 
and a realization of the random code-tree by a^^"\ The probability of a tree a^(") G is uniquely 

determined by Qx"\\Z"-'^{'\\') ^ V{X'^\\Z"'~^). For instance, consider the case of binary feedback, 
Z = {0, 1}, and a tree of depth n = 2, for which D{n) = 3. A code-tree is a vector a"^ = (a^i, .X21, a;22) 
where xi is the symbol sent at time i = 1, 2:21 is the symbol sent at time i = 2 for feedback zi = 0, 
and X22 is the symbol sent at time z = 2 for feedback z\ = l. Then 

Pr(^3 ^ ^3) ^ Q(xi)Q(x2i|xi,zi = 0)Q(x22|a^i,^i = 1) (20) 

which is uniquely determined by In general, for a code-tree of depth n, the following holds. 

Pr(^^(") = a^H) = 1 (21) 
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A code-tree for each message w is randomly generated, and for each message w and feedback sequence 
^n-i j-j^g codeword x^{w, is unique. The decoder is made aware of the code-trees for all messages. 

Assuming that the ML decoder knows the channel 9 in use, it estimates the message as follows. 

w = aig max P(y"'\w, 9) (22) 

w 

As shown in [14], since is uniquely determined by w and z*^^ and since is a deterministic function 
of y\ we have the equivalence 

P{y^\w,9) = Piy-\\x''{w,z''~'),9) (23) 
so the ML decoder can be described as 

w = argmaxP(y"||x"(u;,z"~i),6'). (24) 

w 

Let P^{so, 9) denote the average (over messages) error probability incurred when a code of blocklength n 
is used over channel 9 with initial state sq. The following theorem bounds the error probability uniformly 
in (so, 9) when the decoder knows the channel 6* S in use. The theorem is proved in Appendix HIl 

Theorem 2: For a compound FSC with initial state sq £ S, input alphabet X, and output alphabet y, 
assuming that the decoder knows the channel 9 in use, then there exists a code of rate R and blocklength 



Nm, where > 1 and m is chosen such that Cm > R + e, for which the error probability Pg (sq, 9) 



'>Nm 

of the ML decoder satisfies 



Peiso, 9) < \S\ exp{-Nmp{€, m, \y\)) (25) 



for any 6* G 9, where 

P{e,m,\y\) 



^meV(21og(e|3^r)2) e < ±{\og{e\ynf 

(26) 

e - 2^ (log(e|3^r))' otherwise. 
The result in Theorem [2] is shown by the use of a randomly-generated code-tree of depth Nm for each 

message w. For every feedback sequence z^"^~^, the corresponding path in the code-tree is generated 

by the input distribution G V{X^"'\\Z^'^-^) given by 

where is the distribution that achieves the supremum in Cm- The random codebook C used in 
proving Theorem |2] consists of e^^ code-trees. Each code-tree in the codebook is a concatenated code- 
tree with depth Nm consisting of code-trees, each of depth m. For a given feedback sequence 
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z (coiTcsponding to a certain path in the concatenated code-tree) the codeword is generated by 

example of a concatenated code-tree is found in Figure |2tc). 

B. Achievability for codewords chosen uniformly over a set 

In this subsection we show that the result in Theorem[2]implies that the error probability can be similarly 
bounded when codewords are chosen uniformly over a set. In other words, we convert the random coding 
exponent given in Theorem |2j where it is assumed that the codebook consists of concatenated code-trees 
of depth Nm in which each sub-tree of depth m is generated i.i.d. according to Q*^, to a new random 
coding exponent for which the concatenated code-trees in the codebook are chosen uniformly from a 
set of concatenated code-trees. This alternate type of random coding, where the concatenated code-trees 
are chosen uniformly from a set, is the coding approach subsequently used to prove the existence of a 
universal decoder. 

We first introduce the notion of types on code-trees. Let a^^(™) denote the concatenation of depth- 
m code-trees a^^'^\ where D{m) is defined in ( [T9l ) and a^^^^^ S The type (or empirical 

probability distribution) of a concatenated code-tree a^^^'"^^ is the relative proportion of occurrences of 
each code-tree a^^™) € X^^'^\ Equivalently, N multiplied by the type of a^^^'^) indicates the number 
of times each depth-m code-tree from the set ^^(™) occurs in the concatenated code-tree a^^^'^\ Let 
VNi'^^^"^^) denote the set of types of concatenated code-trees of depth Nm. 

Let Pe{n, R,Q, P) denote the average probability of error incurred when a code-tree of depth n and 
rate R drawn according to a distribution Q G "^(,^"112""^) is used over the channel P. We now prove 
the following result. 

Lemma 3: Given € let Qntu S V{X^'^\\Z^'^-^) denote the distribution given 

by the N-fold product of Qm, i.e., 

N 
i=l 

For a given type QNm G Vn{X^^'^'^), let G V{X^'^\\Z^'^-^) denote the distribution that is 

uniform over the set of concatenated code-trees of type Qntu- For every distribution Qm G V{X'^\\Z^~^) 
there exists a type Qnth S Vn{'^^^"^^) whose choice depends on Qm and N but not on P such that 

Pe{Nm,R,QN^,P) < exp{2Nm6{N,m,\Z\))Pe{Nm,R + m5{N,m,\Z\),QNm,P) (29) 
for all P, where 5{N,m, \Z\) = jAfj^^'") log(iV + 1)/Nm tends to as iV ^ cx). 
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Proof: The proof follows the approach of [4, Lemma 3] except that our codebook consists of code- 
trees rather than codewords; we include this proof for completeness in describing the notion of types on 
code-trees. Given a codebook C of rate R + m6{N, m, \Z\) chosen according to Qniu, we can construct a 
sub-code C of rate R in the following way. Let Q' denote the type with the highest occurrence in C. The 
number of types in C is upper bounded by (A^+l)''^!"*'"' = exp{Nm5{N, m, \Z\)), so the number of con- 
catenated code-trees of type Q' is lower bounded by exp(A^(i?+?7i(^(A^, m, \Z\)))/ eyip{Nm6{N,m, \Z\)) = 
exp{NR). We construct the code C by picking the first concatenated code-trees of type Q'. Since 
C is a sub-code of C, its average probability of error is upper bounded by the average probabihty of 
error of C times \C\/\C'\ = exp{Nm5{N,m, \Z\)). 

Conditioned on Q', the codewords in C are mutually independent and uniformly distributed over 
a set of concatenated code-trees of type Q'. Since C is a random code, the type Q' is also random, 
and let vr denote the distribution of Q'. Pick a realization of the type Q', denoted Qnui, that satisfies 
T^iQNm) > exp{—Nm6{N,m, \Z\)). (This is possible since the number of types is upper bounded by 
ex.p{Nm6{N,m,\Z\)).) Then 

7riQNm)Pe{Nm,R,Q^^,P) < Y,^iQ')Pe{Nm, R,Q' , P) (30) 

Q' 

< exp{Nm6{N,m, \Z\))PeiNm, R + m5{N,m, |Z|), Qat^, P)(31) 



and 



Pe{Nm,R,Qj^^,P) < e^P(^^y I^D) p^(jv^^ ^ ^ ^^(jv, m, \Z\), QNm,P) (32) 
< exp{2Nm6{N,m,\Z\))Pe{Nm,R + m6{N,m,\Z\),QNm,P) (33) 



Combining this result with Theorem |2j we have that there exists a type QiVm G VNi-^^^"^^) such that 
when the codewords are chosen uniformly from the type class of OiVm> given by the distribution Qj^^' 
the average probabihty of error is bounded as 

PeiNm, R,QNm, P) < exp{2Nm5{N, m, \Z\))\S\ exp{-NmP{e-m5{N, m, \Z\)/2, m, |3^|))(34) 



= |5| exp |— A^m 
It is then possible to choose Nq such that for all N > Nq, 



(5 ( e-^m6iN,m, \Z\),m, \y\ ] - 25{N,m, \Z\) 



(35) 



1|;,|.mM^<£ (36) 
2' ' iV 2 
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and 

which impUes that the probabiUty of error is bounded as 

PeiNm,R,Q^^,P) < |5|exp(^-iVm^/3(|,m,|3;l)^ (38) 

C. Existence of a universal decoder 

We next show that when a codebook is constructed by choosing code-trees uniformly from a set, there 
exists a universal decoder for the family of finite-state channels with feedback. This result is shown in 
the following four steps. 

• We define the notion of a strongly separable family Q of channels given by the causal conditioning 
distribution. The notion of strong separability means that the family is well-approximated by a finite 
subset of the channels in Q. 

• We prove that for strongly separable Q and code-trees chosen uniformly from a set, there exists a 
universal decoder. 

• We describe the universal decoder which "merges" the ML decoders tuned to a finite subset of the 
channels in G. 

• We show that the family of finite-state channels given by the causal conditioning distribution is a 
strongly separable family. 

Our approach follows precisely the approach of Feder and Lapidoth [9] except that our codebook consists 
of concatenated code-trees (rather than codewords) and our channel is given by the causal conditioning 
distribution. 

Let a^^(™) denote a concatenated code-tree of depth Nm, a^^^"^) £ p^ND{m) ^^^^^ ^^^^ ^ 
1)/(|2| — 1), and let .Bjvm denote a set of such code-trees, Bj^^m ^ ^^^(H, As described in Lemma [3j 
BNrn will be the set of code-trees of type Qnui ^ ■Piv(A'^('"^) and the code-tree for each message will be 
chosen uniformly from this set, i.e. QNmi^^^^^'') — l/l-^A^ml for any a^^^*"^ G B^m- As described 
below, for a given output sequence y^™, ML decoding will correspond to compaiing the functions 
Pe(y^™|a^-^('^)), a^^^™) E Bnth- Note that comparing the functions Pe(y^™|a^^('")) is equivalent 
to comparing the channel causal conditioning distributions since Pe(y^'"|a^^('^)) = Pg{y'^"^\\x'^"^) as 



Febraary 2, 2008 



DRAFT 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2007. 



12 



shown below. 

Nm 

P^(y^"^|a^^(-)) = llPe{yi\y'-\a''^^^^) (39) 



1=1 

Nm 



= (40) 



i=l 

Nm 



(J 

i=l 



llPeW-\x') (41) 
= Peiy^'^Wx^n (42) 



In the above, (a) holds since z*^^ is a known, deterministic function of y*^^ and (6) holds since the 
code-tree a^^C"*) together with the feedback sequence z^~^ uniquely determines the channel input x*. 

For notational convenience, the results below on the universal decoder are stated for blocklength n, 
where A^^'^^ denotes a code-tree of depth n and 5„ denotes a set of such code-trees. These results extend 
to the set of concatenated code-trees -B^vm and any exceptions are described in the text. Furthermore, 
we introduce the following notation: denotes the ML decoder tuned to channel 6; Pe{B,(f)) denotes 
the average (over messages and codebooks chosen uniformly from a set) error probability when decoder 
is used over channel B; and Pe{0,(l)\C) denotes the average (over messages) error probability when 
codebook C and decoder cj) is used over channel 9. 

Definition 1: A family of channels {PY^\\x^^e{'\\'^B),9 € B} defined over common input and output 
alphabets X, y is said to be strongly separable for the input code-tree sets {Bn}, Bn C 
if there exists some ;u > that upper bounds the error exponents in the family, i.e., that satisfies 

limsupsup — — logPe(^; ^^'e) < A* (43) 

n— »oo 6 fi' 

such that for every e > and blocklength n, there exists a subexponential number K{n) (that may 
depend on and on e) of chaimels {&^^^}^^ ^ © 

lim - log K{n) = (44) 

n— >oo n 

that well approximate any G 6 in the following sense: For any 6 E @ there exists E @, 1 < k* < 
K{n), so that 

P(y"||x'^,0) < 2'^^P(y'^||x",0^!^), V(x",y'") : P(y"||x'", 0) > 2-"('^+^°s '^D (45) 

and 

P(y"||x",0) > 2-"^P(y"||x",0[?^), V(a;",y") : P(y"||x", 0^?^) > 2-"('^+'°g I^D (46) 
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The notion of strong separability means that the family 6 is well-approximated by a finite subset 
{^[,"'*}^"'* C of the channels in the family. In order to prove that the family of finite-state channels 



with feedback is separable, we will need a value /i that satisfies (1431) . The error probability Pe{9,4'e) is 
lower bounded by the probability that the output sequence corresponding to two different messages 
is the same for a given realization of the channel and code-tree. For a random code-tree this is lower 
bounded by a uniform memoryless distribution on the channel output. Then Pe{O,(j)0) > and 
a suitable candidate for /i is 1 + log|3^|. The following theorem shows the existence of a universal 
decoder for a strongly separable family and input code-tree sets Bn- The proof follows from the proof of 
Theorem 2 in [9] except that we replace the channel conditional distribution P{y^\x^,0) with the causal 
conditioning distribution P(y"||x",0). 

Theorem 4: If a family of channels defined over common finite input and output alphabets X,y is 
strongly separable for the input code-tree sets {Bn], then there exists a sequence of rate-i? blocklength-n 
codes Cn and a sequence of decoders {n„} such that 

1 ( Pe{e,Un\Cn) \ „ 

lim sup — log — = (47) 



The universal decoder u„ in Theorem |4] is given by "merging" the ML decoders tuned to channels 
^fe, 1 < < K{n), that are used to approximate the family 0. In order to describe the merging of the 
ML decoders, we first present the ranking function Mq. A ML decoder tuned to the channel 9 can be 
described by a ranking function Mq defined as the mapping 

Me : BNm x 3^^" ^ {1, 2, . . . , |Pjvm|} (48) 

where a rank of 1 denotes the code-tree a^^(™) that is most likely given output y^™, rank 2 denotes 
the second most likely code-tree, and so on. For a given received sequence y^"^, every code-tree in the 
set B^m is assigned a rank. For code-trees a^^^'^\a^^^^'' G Bj^m, 

P,(y^-|af^('")) > P.ly^-laf^^'")) =^ M,(af y^™) < Mo{af''^"'\y^n (49) 

By (l42l) . comparing the function Pe(y^™|a^^("^)) is equivalent to comparing the channel causal condi- 
tioning distribution P0{y^"^\\x^"^). Letting (f>0 denote the ML decoder tuned to 6, we can describe the 
decoder as 

My^"^) =wift Me{a^^^'^\w),y^'^) < Me{a^^^'^\w'),y^'^),'iw' / w (50) 

where a'^^^™'\w) represents the code-tree chosen for message w, 1 < w < e^^. In the case that 
multiple code-trees maximize the likelihood P0(y^"^|a^^^'"'') for a given y^™, the ranking function 
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Mg determines which code-tree (and coiTcspondingly message) is chosen by the decoder. In the case 
that the same code-tree from B]\frn is chosen for more than one message, the ranks will be identical 
and a decoding error will occur. Note that for a given output sequence y^™, the decoder (peiv^"^) will 
not always return the code-tree a^^^"*^ € B^m for which Mg{a'^^^"^\y^"^) = 1, since the code-tree 

^ND{m) jjj^y jjj^y j^Qj ^j^^ Codcboolc. 

Now consider a set of K channels from the family 0, given hy 9}^ ^ Q, \ < k < K. The codebooks 
for these K channels will be drawn randomly from the set B^m. (Note that the same set B^j^ is used 
for all channels Ok since, as shown in Lemma [3l the type Qntu £ VNi'^^^^^) is chosen independent 
of the channel P.) The K ML decoders matched to these channels, denoted (j)g^, (pg^, . . . , (pg^, can be 
merged as shown in [9]. The merged decoder uk is described by its ranking function which is a 
mapping 

Mu^ : BNm X 3^^"^ ^ {1, 2, . . . , \BNm\} (51) 

that ranks all of the code-trees in Bn^ for each output sequence y^"^. The ranking M„^, is established 
for a given by assigning rank 1 to the code-tree for which Mg^ = 1, rank 2 to the code-tree for 
which Alg^ = 1, rank 3 to the code-tree for which Mg^ = 1, and so on. After considering the code-trees 
with rank 1 for all Mg^, the code-trees with rank 2 in Mg^, 1 < k < K are considered in order and 
added into the ranking M^^. The process continues until the code-trees with rank \Bj\fm\ for all Mg^ 
have been assigned a rank in M^^. Throughout this process, if a code-tree has already been ranked, it 
is simply skipped over, and its original (higher) ranking is maintained. The rank of a code-tree in M„^, 
can be upper bounded according to its rank in Mg^ as shown in [9] and stated as follows. 

M,,(a^^('-),y^"^) =i =^ M„,(a^^(-),y^-) < {j-l)K + k, Va^^(") G i?^^,Vfc,l < k < K 

(52) 

This bound on the rank in M^^ implies another (looser) upper bound. 

M„^(a^^(™),y^"^) < i^Me,(a^^("),y^"^), V(a^^("\ y^'™) € B^m x y^"',yk, l<k<K 

(53) 

Equation (1531 ) can be used to upper bound the error probability when sequences output from the channel 
G G are decoded by the merged decoder uk- This is a key element of the proof of Theorem |4l Finally, 
we state the lemma below, which shows that the family of finite-state channels defined by the causal 
conditioning distribution is strongly separable. Together with Theorem |4l this establishes existence of a 
universal decoder for the problem we consider, and completes our proof of achievability. 
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Lemma 5: The family of all causal-conditioning finite-state channels defined over common finite input, 
output, and state alphabets X,y,S is strongly separable in the sense of Definition [T] for any input code- 
tree sets {Bn}- 

Proof: See Appendix Hill ■ 

V. Compound Gilbert-Elliot channel 

The Gilbert-Elliot channel is a widely used example of a finite state channel. It has a state space 
consisting of 'good' and 'bad' states, S = {G, B} and in either of these two states, the channel is 
a binary symmetric channel (BSC). The Gilbert-Elliot channel is a stationary and ergodic Markovian 
channel, i.e., P{yi, Si\xi, Si^i,9) = P{si\si-i,9)P{yi\xi,Si-i,0) is satisfied and the Markov process 
described by P{si\si-i,6) is a stationary and ergodic process. For a given channel 9, the BSC crossover 
probability is given by Pb{0) for Si = B and Pg{9) for Si = G. The channel state Si forms a stationary 
Markov process with transition probabilities 

g{e) = P{Si = G\S^-i = B) = l-P{Si = B\Si^i = B) (54) 

h{e) = P{S, = B\S,-i = G) = l-P{Si = G\Si.i = G) (55) 
For a given 6, the Gilbert-Elliot channel is equivalent to the following additive noise channel 

Y, = X,® Vi (56) 

where © denotes modulo-2 addition and Vi G {0, 1}. Conditioned on the state process {-Si}!^, the noise 
Vi forms a Bernoulli process given by 



PB{e), Si = B 

(57) 



Pcie), Si = G. 

For a given channel 9, the capacity of the Gilbert-Elliot channel is found in [8] and is achieved by a 
uniform Bernoulli input distribution. 

The following example illustrates that the feedback capacity of a channel with memory is in general 
not given by 

CFB = iniCe, (58) 

6 

as in the memoryless case. 

Example 1: [4] Consider the example of a Gilbert-Elliot channel where Pg{9) = 0, Pb{9) = 0.5, b{9) = 
g{9) = 2^^ for 9 = 1,2,3.... with feedback. The compound feedback capacity of this channel is zero 
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because assuming that we start in the bad state, for any blocklength n, the channel that corresponds to 
9 = n, will remain in the bad state for the duration of the transmission with probability (1 — 2^")" > 
1 - n2-" > i. While the channel is in the bad state the probability of error for decoding the message 
is positive with or without feedback, hence no reliable communication is possible. 

However if we fix 9, then the capacity Cg is at least 1 — hb{j), because we can use a deep enough 
interleaver to make the channel look like memoryless BSC with crossover probability j. 

A Gilbert-Elliot channel is described by the four parameters g{9),b{9), Pg{9), and Pb{G) that lie 
between and 1 and for any fixed n, sq) is continuous in those parameters. The continuity of 

jz", So) follows from the fact that P{yi,Si\xi,Si^i) is continuous in the four parameters for any 
i >1, and also because (as shown in Appendix Hill in Eqns. (Illll ) and (II 131 )) we can express sq) 
as 

s" 

n 

= '^WP{yi,Si\xi,Si-i). (59) 

i=l 

Let us denote by Q the closure of the family of channels. Hence instead of inf^ee we can write 
miriggQ since G is compact and since T{Q; P) is continuous in P. Now, let Qu{x^) denote the uniform 
distribution over X"^. We have 

(a) 

max mini {Q ; P) < mmmaxl{Q; P) 

Q so,9 so,e Q 

= mm I {Qu;P) 

so,e 

(60) 

where (a) follows from the fact that maxmin < minmax and (6) follows from the fact that for any 
channel a uniform distribution maximizes its capacity. Therefore we can restrict the maximization to the 
uniform distribution Qu instead of Hence feedback does not increase the capacity of the 

compound Gilbert-Elliot channel. This result holds for any family of FSCs for which the uniform input 
distribution achieves the capacity of each channel in the family and is closely related to Alajaji's result 
[17] that feedback does not increase the capacity of discrete additive noise channels. 

VI. Feedback capacity is positive if and only if capacity without feedback is positive 

In this section we show that the capacity of a compound channel that consists of stationary and 
uniformly ergodic Markovian channels is positive if and only if it is positive for the case that feedback 
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is allowed. The intuition of this result comes mainly from Lemma |9] that states that 

max liX"" ^ y^) = ^ max/(X" ^ F") = 0. (61) 

The reason our proof is restricted to the family of channels that are stationary and uniformly ergodic 
Markovian is because for this family of channels we can show that the capacity is zero only if for every 
finite n, 

max inf/(X" ^ y'le') = 0. (62) 

Qa''>||y">-i 

A stationary and ergodic Markovian channel is a FSC where the state of the channel is a stationary 
and ergodic Markov process that is not influenced by the channel input and output. In other words, the 
conditional probability of the channel output and state given the input and previous state is given by 

P{yi,Si\xi,Si-i,9) = P{si\si-i,9)P{yi\xi,Si-i,9) (63) 

where the Markov process, described by the transition probability P{si\si-i,6), is stationary and ergodic. 
We say that the family of channels is uniformly ergodic if all channels in the family are ergodic and for 
all e > there exists an M(e) such that for all n > M 

\Pr{Sn = s\so,e) - P{s\e)\ <e, Vsq G 5, s G 5, ^ G G (64) 

where P{s\6) is the stationary (equilibrium) distribution of the state for channel 9. We define the sequence 

C Markovian „o 
n 

(jMarkovian ^ max inf -/(X" ^ |6') . (65) 

Theorem 6: The channel capacity of a family of stationary and uniformly ergodic Markovian channels 
is positive if and only if the feedback capacity of the same family is positive. 

Since a memoryless channel is a FSC with only one state, the theorem implies that the feedback capacity 
of a memoryless compound channel is positive if and only if it is positive without feedback. The theorem 
also implies that for a stationary and ergodic point-to-point channel (not compound), feedback does not 
increase the capacity for cases that the capacity without feedback is zero. The stationarity of the channels 
in Theorem |6] is not necessary since according to our achievability definition, if a rate is less than the 
capacity, it is achievable regardless of the initial state. We assume stationarity here in order to simplify 
the proofs. The uniform ergodicity is essential to the proof that is provided here but there are also other 
family of channels that have this property. For instance, for the regular point-to-point Gaussian channel 
this result can be concluded from factor two result that claims that feedback at most doubles capacity 
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(c.f., [18]-[20]). The proof of Theorem [6] is based on the following lemmas. We refer the reader to 
Appendix |IV] for the proofs of these lemmas. 

Lemma 7: For any channel with feedback, if the input to the channel is distributed according to 

Q(x"||z"-i) = Q{x\\\z1-')Q{xl^,\\zl-l), 

then 

/(X" ^ > /(X'^ ^ Y^) + I{X^+i ^ Yfc'Vi). (66) 
Lemma 8: The feedback capacity of a family of stationary and uniformly ergodic Markovian channels 

is 

lim C^t<^rkovian_ ^^j^ 
n— >oo 

The Umit of C^^'^rkovian grists and is equal to SUp„ C^tarkovtan_ 

Lemma 9: Let the input distribution to an arbitrary channel be uniform over the input i.e., Q{x^) = 
If under this input distribution — > y")=0, then the channel has the property that P(y"||a;") = 
P(y") for all x" E E ^ and this implies that 

max ^ y") = 0. (68) 

Proof of Theorem^ Let Cnfb denote the capacity without feedback and CpB denote the capacity 
with feedback. Cnfb = ^ Cfb = i& trivial. To show that Cnfb = CpB = 0, we use 

Lemma [8] to conclude that since Cnfb = then sup„ c^°'^^°'"^°'^ = and therefore for any n>l, 

maxinf /(X" ^ = 0. (69) 

In order to conclude the proof, we show that if ( [69l ) holds, then it also holds when we replace Qx" by 
Since /(X" Y^) is continuous in and since the set © is a subset of the unit 

simplex which is bounded, then the infimum over the set Q can be replaced by the minimum over the 
closure of the set ©. Since (|69l ) holds also for the case that Qx" is restricted to be the uniform distribution, 
then Lemma |9] implies that the channel that satisfies P(y"||x") = P{y^) for all x" E X'^,y^ E is in 
the closure of and therefore 

max inf /(X*" ^ y"|6') = 0. (70) 
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VII. Feedback capacity of the memoryless compound channel 
Recall that the capacity of the memoryless compound channel (without feedback) is [1], [2] 

maxinfT((5x;-fV|x,e)- (71) 

Qx 6 

Wolfowitz also showed [3] that when 9 is known to the encoder, the capacity of the memoryless compound 
channel is given by switching the inf and the max, i.e., 

infmaxJ(Qx; -Py |x,f?)- (72) 

Qx 

In this section we make use of Theorem [T] to show that ( 1721 ) is equal to the feedback capacity of the 
memoryless compound channel. 

A. Finite family of memoryless channels 

Based on Wolfowitz 's result it is straightforward to show that if the family of memoryless channels 
is finite, |0| < oo, then the feedback capacity of the compound channel is given by switching the max 
and the min, 

minmay:I{Qx; PY\x,e)- (73) 

o Qx 

This result can be achieved in two steps. Given a probability of error P(, > 0, first, the encoder will use 
M uses of the channels in order to estimate the channel with probability of error less than Since the 
number of channels is finite such an M exists. In the second step the encoder will use a coding scheme 
with blocklength N adapted for the estimated channel to obtain an error probability that is smaller than 
Hence we get that the total error of the code of length M + is smaller than Pg- 

B. Arbitrary family of memoryless channels 

For the case that the number of channels is infinite, the argument above does not hold, since there is 
no guarantee that for any Pe > there exists a blocklength n{Pe) such that a (e"^,n) code achieves an 
error less than Pg for all channels in the family. 1^ However, we are able to establish the feedback capacity 
using our capacity theorem for the compound FSC, and the result is stated in the following theorem. 

^In a private communication witii A. Tchamkerten [21], it was suggested that tlie feedbaclc capacity of the memoryless 
compound channel with an infinite family can also be established using the results in [9] (which show that the family of all 
discrete memoryless channels is strongly separable). The family is finitely quantized, a training scheme is used to estimate the 
appropriate quantization cell, the coding is performed according to the representative channel of that cell and the decoding is 
done universally as in [9]. 
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Theorem 10: The feedback capacity of the memory less compound channel is 

infmaxJ(Qx;i^y|x,0)- (74) 

d Qx 

Theorem [10] is a direct result of Theorem [T] and the following lemma. 
Lemma 11: For a family of memory less channels we have 

lim - max inf T((5jf,.||y„-i; Py„||X"fl) = inf maxT((5x; ^V|x,e) (75) 
The proof of Lemma [TT] requires two lemmas, which we state below. The proofs of Lemmas [T2l and [T3] 
are found in Appendix IVl 

Lemma 12: Let Q\- = argmaxg^ -Py|x,ei) and Q\ = argmaxg^ J((5x, -Pyix.es)- For two 

conditional distributions Py|x,6»i and Py|x,02 ^i^h 

A = ||Py|xA = Yl \PY\x,eMx^'^i)-PY\x,eMx,S2)\ (76) 

there exists an upper bound 

\I{Qjc,PY\XA)-^iQx,PY\x,eJ\<^i^) (77) 

where ry(A) ^ as A — > 0. 

Lemma 13: For any 5 > 0, any e > and any channel Py\x^ there exists an M such that we can 
choose a channel Py^x 9 ^ function of M inputs and outputs such that 

Pr{A >e}<d, (78) 

where A denotes the Li distance between the estimated channel Py\x § ^^'^ the actual channel Py\x^ 
i.e., 

^= E \PY\x/y\^^^)-PY\x{y\x)\. (79) 

xex,yey 

Proof of Lemma [77} We prove the equality by showing the following two inequalities hold: 

- max inf2:(Qx"||y"-i;^y"||X",e) < inf maxX(Qx; -Py|x,e), (80) 

- max infI((5x"||y"-i;-Py||X",e) > inf max J(Qx; -Pyix.e) - ^n, (81) 

where e„ ^ as n ^ oo. Inequality ([80l) is proved by the fact that maxinf is less than or equal to 
inf max and by the fact that for a memoryless channel an i.i.d input maximizes the directed information. 

- max inf2:(Qx"||y"-i;-Py"||X",e) 
< iinf max J(Q^„||y„-i; Py^nj^^^e) 

= inf max2:(Qx;-Py|x.0) (82) 

o Qx 
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In order to prove inequality (1811) we consider the following input distribution. The first M inputs are 
used to estimate the channel and we denote the estimated channel as 9. After the first M inputs, the 
input distribution is the i.i.d distribution that maximizes the mutual information between the input and 
the output for the channel 9. According to Lemma [131 we can estimate the channel to within an Li 
distance smaller than e > with probability greater than 1 — 5, where 5 > 0. According to Lemma [T2j 
by adjusting the input distribution to a channel that is at Li distance less than e from the actual channel 
in use, we lose an amount that goes to zero as e ^ 0. Under the input distribution described above we 
have the following sequence of inequalities. 

- max infT(Qx"||y"-i;-Py||X",9) 

- max inf/(X" ^y"|6') 

Tl Qj^Ti||^n-l 9 

> - max inf V I iX' ^YAY'-^) 
'"^^ i=M{5,t)+l 



n 



> -^max inf V I{Xi,^^;Yi\Y'-\ X^') 



n 

(J i 
n 



i=M+l 
n 



max inf ^ IiXl,^,;Y,\Yj,^^„ X^ ,Y^ ,e{X^' ,Y^')) 



(e) 1 

> -maxinf(n-M)I(X;y|6',e) 

n Qx\§ s 



(/) 



- max inf (n - M) V P{e)I{Q^,f, Py\x,6 



(9) 1 

> - maxinf(n - M)(l - 6)I{Qxie; PY\x,e) " r]{e)) 

n Qx\e H 

iinfmax(n-M)(l-(5)2:(Qx;i^y|x,e)-??(e)) (83) 

n e Qx ' 

(a) and (f) follow from a change of notation. 

(b) follows the fact that we sum fewer elements. The parameter M is a function of e > and 6 > 
and is determined according to Lemma [131 For brevity of notation we denote M(e, 6) simply as M. 

(c) follows from the fact that H{Y^\Y'-'^) > H{Yi\Y'-^ , X^'^). 

(d) follows from the fact that the estimated channel is a random variable denoted as and it is a 
deterministic function of , Y^^ as described in Lemma \T3\ 

(e) follows by restricting the input distribution Qx"||y"-i to one that uses first M uses of the chan- 
nel to estimate as described in Lemma [131 and then uses an i.i.d distribution, i.e., for i > M, 
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Q{xi\x'-\y'-^)=Q{xi\x'-\y'-\e{x^^^,y^^^))) = Q{xS)- 

(g) follows from the fact that with probability 1 — (5 we have that the Li distance | \PY\x,e~ Py\x eW"^ — ^ 
and by applying Lemma [T2l which states that for this case we lose 7]{e) where 77(e) ^ as e ^ . 

(h) follows from the fact that inf^maxg^ is identical to maxg^i^ inf^. 

Finally, since M is fixed for any e > 0, 6 > then we can achieve any value below infe maxQ^ X{Qx] PY\x,e) 
for large n. Therefore inequality (|8T| ) holds. ■ 

VIII. Conclusion 

The compound channel is a simple model for communication under channel uncertainty. The original 
work on the memory less compound channel without feedback characterizes the capacity [1], [2], which 
is less than the capacity of each channel in the family, but the reliability function remains unknown. 
An adaptive approach to using feedback on an unknown memory less channel is proposed in [16], where 
coding schemes that universally achieve the reliability function (the Bumashev error exponent) for certain 
families of channels (e.g., for a family of binary symmetric channels) are provided. By using the variable- 
length coding approach in [16], the capacity of the channel in use can be achieved. In our work, we 
consider the use of fixed length block codes and aim to ensure reliability for every channel in the family; 
as a result, our capacity is limited by the infimum of the capacities of the channels in the family. For the 
compound channel with memory that we consider, we have characterized an achievable random coding 
exponent, but the reliability function remains unknown. 

The encoding and decoding schemes used in proving our results have a number of practical limitations, 
including the memory requirements for storing codebooks consisting of concatenated code-trees at both the 
transmitter and receiver as well as the complexity involved in merging the maximum-likelihood decoders 
tuned to a number of channels that is polynomial in the blocklength. As such, our work motivates a search 
for more practical schemes for feedback communication over the compound channel with memory. 

Appendix I 
Proof of Proposition [T] 

The proposition is nearly identical to [4, Proposition 1] except that we replace I{X^;Y"'\so,6) by 
/(X" — > Y^\so,9) and (5(x") by Q{x'^\\z^~^) using results from [14] on directed mutual information 
and causal conditioning. We first prove the following lemma, which is needed in the proof of Proposition 
[U The lemma shows that directed information is uniformly continuous in Qx^\\y^-i- For our time- 
invariant deterministic feedback model, Q{x'^\\y''^~^) = Q{x^\\z'^~^), and the lemma holds for any such 
feedback. 
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Lemma 14: (Uniform continuity of directed information) If ^^"liy-i <3x"||y-i causal 
conditioning distributions such that 

\Q\x^\\y^-') - Q\x^U-^)\ < A < 1 



then for a fixed Py^\\X" 



A 



|2:(g3f.||yn-i;Py»||X")-2:(g^.||r.-i;^'y"||X'>)l < -Alog^. 



(84) 



(85) 



Proof: Directed information can be expressed as a difference between two terms /(X" V^) = 
H{Y'') - H{Y-"\\X''). Let us consider the total variation of P^„(-) - P^„(-)' 

^|pi(y-)-p2(y")| = ^ 5^pHa^",y")-^'(^",y") 

yn yV. j.n 



yn 

< A 

By invoking the continuity lemma of entropy [22, Theorem 2.7, p33] we get, 

A 



(86) 



(87) 



< 



where and are the entropies induced by -Fyn(-) and respectively. Now let us 

consider the difference H^{y\\X'') - il2(y"||X"). 

|iJ^(F"||X") -ii-2(y"||X")| 

^ -pi(a;",y")logP(y"||a;") + p2(^n y-)logP(y-||a;") 

5^ -P(y"||a;")gi(x"||y"-i)logP(y"||x-)+P(y-||x")Q2(;^n||y-i)logP(y-||a;-) 
5^ -P(y"||x-)logP(y"||x")(Qi(x-||y"-i)-Q2(ar"||y"-i)) 
^ -P(y"||x-)logP(j/"||x")|Qi(x"||y-i)-Q2(a;-||y"-i)| 

< ( E -Wlk")logP(2/"||x")) ( ^ \QHx-\\y--')-Q\x-\\y--')\\ 
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< log|3^"|A (88) 



By combining inequalities (187]) and (1881) we conclude the proof of the lemma. ■ 
By Lemma [141 I{X"- Y"-\so, 9) is uniformly continuous in Qx^\\z^-^- Since Qx"\\Z"-^ is a member 

of a compact set, the maximum over Qx"||Z"-i is attained and C„ is well-defined. 

Next, we invoke a result similar to [4, Lemma 5]. Given integers k and m such that k + m = n, input 

sequences x\ = (xi, . . . , x^) and x^^^ = (x^+i, . . . , Xn) with corresponding output sequences y'l and 

y^j^i, let be defined as 

= Q{x\\\zl-^)Q{xl^,\\zl-l). 

Then 

inf /(X" ^ y"|so,0) > inf /(Xf ^ Fi^lso,^) + inf /(X^+i ^ " log l^l- 



This result follows from [4, Lemma 5] and [14, Lemma 5]. 

Finally, if we let Q{x\\\z\^'^) and Q{x'^j^^\\z^^\) achieve the maximizations in Ck and C„i, respec- 
tively, then we have 

nCn > inf/(X" ^y^lso,^) 

So,9 

> inf /(Xf ^ Y,^\so,0) + inf ^ Y,\,\sk,9) - log \S\ 

So,a so,0 



kCk + mCm - log 



or equivalently. 



nCn > kCk + mC\ 



Clearly lim„^oo Cn = lim„^oo Cn, and by the convergence of a super-additive sequence, lim„^oo Cn = 

Appendix II 
Proof of Theorem [2] 

The theorem is proved through a collection of results in [4] and [14]. Let P^^{6) denote the error 
probability of the ML decoder when a random code-tree of blocklength n is used at the encoder. 

p:,JO)= Y1 P{y''\K{w,z^-'),d) (89) 

The following corollary to [14, Theorem 8] bounds the expected value E[P^^{9)], where the expectation 
is with respect to the randomness in the code. The result holds for any initial state sq. 
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Corollary 15: Suppose that an arbitrary message w, I < w < e"^, enters the encoder with feedback 
and that ML decoding tuned to 9 is employed. Then the average probability of decoding error over the 
ensemble of codes is bounded, for any choice of p, < p < 1, by 



nR 



(90) 



Proof: Identical to [14, Proof of Theorem 8] except that is replaced by P{y"-\\x"-,6). ■ 

Next, we let P^{so, 6) denote the average (over messages) error probability incurred when a code-tree 
of blocklength n is used over channel 6 with initial state sq. Using Corollary [TSl we can bound P"(so, 0) 
as in the following Corollary to [14, Theorem 9] 

Corollary 16: For a compound FSC with \S\ states where the codewords are drawn independently 
according to a given distribution Qn G V{X'^\\Z'^~^) and ML decoding tuned to 9 is employed, the 
average probability of error P'^{sq, 9) for any initial state sq G S, channel 6* € 0, and /), < p < 1 is 
bounded as 

P,"(so, 9) < \S\ exp (-n(F"(p, g„, 9) - pR)) (91) 



where 



-plog|5| 



n 



+ min£:o(/0, Qn-,so,9) 



(92) 



EQ{p,Qn,SQ,9) 

Proof: Identical to [14, Proof of Theorem 9] except for: (i) we replace P(y"||x", so) by P{y'^\\x'^ , sq,9), 
(ii) we consider the error averaged over all messages (rather than the error for an arbitrary message w), 
and (iii) we assume a fixed input distribution Qx"||Z"-i rather than minimizing the error probability over 
all ■ 

The two results stated above provide us with a bound on the error probability, however, the bound 
depends on the channel 9 in use. Instead, we would like to bound the error probability uniformly over 
the class ©. To do so we cite the following two lemmas from previous work. 

Lemma 17: Given Qk € V{X''\\Z^-'^) and Qm G V{X"'\\Z"'~'^), let m = n - k and define 



Qn{Xi\\z^ ^) — Qk{Xi\\z^ ^)Qm{Xk+l\\^k+l)- 



Then F'^{p,Qn,9) as defined in Corollary [T6] satisfies 



F'^ip, Qn, 0) > -F\p, Qk,9) + -F-(p, Qm, 9). 
n n 



(93) 



(94) 
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Proof: Identical to [14, Proof of Lemma 11] except that we replace P(y"| sq) by i-*(y"| |x", sqi 



Lemma 18: 

Eo{p, Qn, SO, 9) > ^pliQn, PY-\\x-,so,e) - (log(e|3^|))' (95) 
Proof: The lemma follows from [4, Lemma 2], which holds for a channel P and input distribution 
Q satisfying Q(x"||z"-i) = 1 and ^^^ y„ = L ■ 

We now follow the technique in [4] by using Lemmas [17] and [18] to bound the error probability 
independent of both sq and 9. For a given rate R < C , let e = {C — R) /2 and pick m in such a way 
that Cm> R + e. Then 

1 I ^ I 

max inf — T(Qx-||z— i;^y"||x™,so,e) >R + e. (96) 

Let G ■p(^™||Z™~^) be the input distribution that achieves the supremum in Cm, i-e-, 

1 lie! 

inf -IiQ*m, PY-\\x-,so,e) - >R + e (97) 

so,9 m " m 

Next, we use Q*^ to define a distribution Q]\[m G •p(|;i^A^»"[|2^'"~i) for a sequence of length Nm, 

> 1, as follows. 

AT 



j=l 

For this new input distribution and sequence of length Nm, we can bound the error exponent 

F''"'{p,QNm,9)-pR (100) 



as shown below. 



> F"'ip,Q*^,9)-pR (101) 

= uimEo{p,Q*m,so,9)-p(R+^-^^) (102) 
So \ m J 

> minlp/(Q;,;Py„||x™,,„,,) - ^p" {\og{e\y"-\)f - p (r+^^^\ (103) 

So m II ' ' 2m \ m J 



(c) 1 



> /5e-T^p'(log(e|3^"^|))' (105) 
2m 
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where (a) is due to Lemma [Tvl (b) follows from Lemma [TSl and (c) follows from ( |97l ). As in [4], we 
can maximize the lower bound on the error exponent by setting p = min(l,me/ (log(e|3^™|))^). With 
this choice of p we have 

mey{2log{e\yrf) e < i-(log(e| J^r))^ 



F''"'{p,Qn^,0)-pR> < 
Theorem |2] follows by combining (1106b with the result in Corollary [16] (for blocklength Nm). 



(106) 

e - 2^ (log(e|3^|'"))^ otherwise. 



Appendix III 
Proof of Lemma [5] 

To prove the lemma, we must first establish two equalities relating the channel causal conditioning 
distribution P(y"| sQ) ^) to the channel probability law P{yi,Si\xi,Si-i,9). The following set of 
equalities hold. 

P(y^x"|so,^) = J2 Piy'',x'',s^\so,0) (107) 



(a) 



^ P(x"||y"-\s"-\so,e)P(2/",s"||x",so,e) (108) 



^ P(x"|b"-\5o,e)^'(y",s"||x",5o,^) (109) 



= P(x"||y"-\5o,^) Yl Piy'',sV,so,9) (110) 

where (a) is due to [14, Lemma 2] and (6) follows from our assumption that the input distribution 
does not depend on the state sequence s"~^. By the chain rule for causal conditioning [14, Lemma 1], 
(IllOl) implies that 

P(y"||x",5o,^)= E n2/",5"lk",so,^)- (111) 

Also, 

n 

P{y^,s^\\x^,so,9) = llP{yi,s,\x'-\y'~\s'-\e) (112) 

1=1 

n 

= JJ-P(yj,Si|xi,Sj_i,6') (113) 



1=1 



where (c) follows from the definition of the compound finite-state channel. Having established equations 
(lllll ) and (11131 ). Lemma [5] follows immediately from [9, Lemma 12], where the conditional probability 
P{yi, Si\xi, Si-i,6) is quantized and the quantization cells are represented by channels {6["'\ . . . , ^^(^-j}- 
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The proof of our result differs only in that the upper bound on the error exponents in the family is given 
by ;u = l + log|3^|. 

Appendix IV 
Proof of Lemmas [7j [8] and [9] 

The proof of Lemma |7] is based on an identity that is given by Kim in [15, eq. (9)]: 



/(X" ^ y") = Y^I{Xi;Y;'\X'-\Y'-^) (114) 

i=l 

Proof of Lemma \7} Using Kim's identity we have 

n 
i=l 

k n 

= ^/(Xi;y,'^|X^-\y^'i)+ ^ I{Xi;Yr\X'-\Y'~') 

i=l i=k+l 
k n 

> Y.I{Xf,Y^\X'-\Y'-')+ liX^iYl'lX'-^Y'-') 

i=l i=k+l 
n 

= I{X''^Y'')+ liXi^Y^'lX'-^Y'-^). (115) 

i=k+l 

Now we bound the sum in the last equality, 

n n 

Y I{Xi;Y^\X'-\Y''^) = H{Xi\X''\Y''^) - H{Xi\X'-\Y'-\Y^) 

i=k+l i=k+l 



n 

(a) 



Y H{X,\Xl-\,Y^~l) - H{X,\X^'\Y^~\Yn 



=k+l 
n 



> Y H{X,\X;^,Y^^l)-H{X,\Xi^,Y^^,Yr) 

i=k+l 

= (116) 

where (a) follows from the assumption that Q(x"| jz""-*^) = Q{xi\\z^^^)Q{x^^-^^\\z'^~l). ■ 
Proof of Lemma \8\ The proof consists of two parts. In the first part we show that nC^"'^^"'"^"'"' is 
sup-additive and therefore lim„^oo (7*^°'''^°^*°" = sup„ c^^'^^^"^^'^"'. In the second part we prove the 
capacity of the family of stationary and uniformly ergodic Markovian channels by showing that 



lim Cn = lim C7^^"^^°""«. (117) 

n— >oo n— >oo 



where C„ is defined in ([TT]) . 
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First part: We show that the sequence c^^°'^^°'"'^°''^ is sup-additive and therefore the Umit exists. Let 
integers k and m be such that k + m = n and denote input distributions Q{x''^'\\z"'^^), Q{x\\\z\^^), and 
QC^fc+ilkfc+i) in shortened forms as Qn,Qk^ and Q^. We have, 

^(jMarkovian ^ max inf /(X" ^ 10) 

(a) 

> max inf/(X" ^ y"|6l) 

QkQm G 

ib) 



> max inf 



> max 



= maxinf /(X^' ^ Y^I^) + maxinf ^ y,!Vi|0) 
maxinfliX^ ^Y^\e)+ max inf /(X™ ^ ^"10) 

— j^^Markovian _^ ^^Markovian (118) 

where (a) follows by restricting the maximization to causal conditioning probabilities of the product form 
Q(x"| Iz""-*^) = (b) follows fromLemma|7J and (c) follows from stationarity 

of the channel. 

Second part: We show that lim„_,oo C„ = lim„^oo (7^'*'^'=°"'^". Due to Lemma 5 in [14], 
V\9) - I{X" y"|S'o,6')| < log \S\, therefore it is enough to prove that 



lim — 



max inf /(X" ^ ¥""130,9) - max inf /(X" ^ y", |so,^) 



0. (119) 



The difference in dl 191) is always positive, hence it is enough to upper bound it by an expression that 
goes to zero as n ^ oo. Again by Lemma 5 in [14] we can bound the second term in dl 191 ). 

max inf/(X" ^y",|so,^) 

> max inf I(X" ^y",|5fc,so,0) -log|5| 

> ^max miI{X^ ^Y^,\Sk,so,9)-log\Sl 

max inf /(X"-'=^y"-^|So,s_fc,e)-log|5|, (120) 
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where (a) holds for every k > I and is due to Lemma |7] and (b) holds by the stationarity of the channel. 
Hence, (11201 ) implies that we can bound the difference, 

max inf /(X" ^ y"|5o, ^) - max inf /(X" ^ F", |so, ^) 
< (klog\y\+ max inf /(X"-'^ ^ F'^-^l^o, ( 



i-k \ \ p-n-k-l 



max inf I{X^-'' ^ |5o, 6) - log \S\ 

(b) 

< fclog|3^I + e(n-A;)log|3^|+log|5|. (121) 

Inequality (a) is due to the fact that /(X" ^ F") < A; log \y\ + /(X""'' ^ F""'') and due to (fT20l) . 
Inequality (b) holds since for a uniformly ergodic family of channels, |P(so|s-A;, ^) — P{so\9)\ < e for 
all So ^ S implies that for any input distribution and any channel 6, 

^ 5o) - I(Xr' ^ 1^"-', 1^0, s-fc, ^)| < e{n - k) log \y\ 

After dividing (|121l) by n, and since e can be arbitrarily small and k is fixed for a given e, then (1119b 
holds. 

■ 

Proof of Lemma \9\ From the assumption of the lemma we have 

By assuming a uniform input distribution, Q(x") = and by using the fact that if the KuUback 
Leibler divergence D{p\\q) = "^^j.^;^ pix)log^^ is zero, then p{x) = q{x) for all x G A", we get that 
(fT22l) implies that P(y"[|x") = P(y") for all x" G ;f",y" G y^. It follows that 

p(y"[|x" 



max /(X" ^ y") = max E 



log- 



(123) 



max E[0] = 0. (124) 



Appendix V 
Proof of Lemmas [T2] and [T3] 

Proof of Lemma \T2\ The proof is based on the fact that Z{Qx, Py\x) is uniformly continuous in -Py|X' 
namely for any Qx, 

\T{Qx.PY\xfi.) -AQx.PY\xfi.)\ < r{A), (125) 
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where t(A) ^ as A ^ (The uniform continuity of mutual information is a straightforward result of 
the uniform continuity of entropy [22, Theorem 2.7]). We have, 

\I{Ql,PY\xA)-^iQx^PY\x,e,)\ 

= \I{Q\,PY\x,ei) — -fVix.Sa) + ^(Qx) -fyix.fia) ~ ^{Q\c^ -fV|X,6li)l 

< T{^) + \l{Ql,PY\xfi,)-T{Q]c,PY\x,eX (126) 

where the last inequality is due to (11251 ). We conclude the proof by bounding the last term in (11261 ) by 
r(A), which impUes that if we let r/(A) = 2r(A) then (|77]) holds. 

< l{Q\,PY\x,e.)-I{Q\.PY\x,e,) 

< r(A). (127) 
Similarly, we have 1{Q]^, Py\x,0i) — '^{Q\iPY\x,e2) ^ ''"(^)' therefore 

\AQI.Py\xm.)-AQ]c.Py\xm:)\ < r{A). (128) 

■ 

Proof of Lemma \T3} The channel Py\x § chosen by finding the conditional empirical distribution 
induced by an input sequence consisting of copies of each symbol of the alphabet X. We estimate 
the conditional distribution Py|„ separately for each a G X. We insert = a for m = uses of the 
channel and we estimate the channel distribution when the input is x = a as the type of the output which 
is denoted as Pym|a- From Sanov's theorem (cf. [23, Theorem 12.4.1]) we have that the probability that 
type Py™|a will be at Li-distance larger than ei = from PY\a is upper bounded by 

Pr{||Py™|,-Py|J|i >ei}< (m + l)l^lexp(-m min I)(Py||Py|J, (129) 

PY-\\PY-PYla)\\l>ei 

where L>(Py||Py|a) = Y^y^y Pviv) log p^^(^|a) denotes the divergence between the two distributions. 
Using Pinsker's inequality [23, Lemma 12.6.1] we have that 

min D(Py||Pyu) > ^ (130) 

and therefore, 

Pr{||Py™ -Py|J|i > ei} < (m + 1)1^1 exp [-m^^ (131) 
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The term (m + l)l-^lexp(— m^) goes to zero as m goes to infinity for ei > and therefore, for any 
> we can find an m such that (m + l)l-^lexp(— m^) < Finally we have, 

Pr{A > e} < Pr I U - > ^ I < |^| A (132) 

where the inequality on the right is due to the union bound. ■ 
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